AAAI.2024 - Cognitive Modeling and Cognitive Systems | Cool Papers

#1 Operationalizing Essential Characteristics of Creativity in a Computational System for Music Composition [PDF] [Copy] [Kimi³]

Authors: Paul M. Bodily ; Dan Ventura

We address the problem of building and evaluating a computational system whose primary objective is creativity. We illustrate seven characteristics for computational creativity in the context of a system that autonomously composes Western lyrical music. We conduct an external evaluation of the system in which respondents rated the system with regard to each characteristic as well as with regard to overall creativity. Average scores for overall creativity exceeded the ratings for any single characteristic, suggesting that creativity may be an emergent property and that unique research opportunities exist for building CC systems whose design attempts to comprehend all known characteristics of creativity.

#2 Neural Reasoning about Agents’ Goals, Preferences, and Actions [PDF] [Copy] [Kimi²]

Authors: Matteo Bortoletto ; Lei Shi ; Andreas Bulling

We propose the Intuitive Reasoning Network (IRENE) - a novel neural model for intuitive psychological reasoning about agents' goals, preferences, and actions that can generalise previous experiences to new situations. IRENE combines a graph neural network for learning agent and world state representations with a transformer to encode the task context. When evaluated on the challenging Baby Intuitions Benchmark, IRENE achieves new state-of-the-art performance on three out of its five tasks - with up to 48.9% improvement. In contrast to existing methods, IRENE is able to bind preferences to specific agents, to better distinguish between rational and irrational agents, and to better understand the role of blocking obstacles. We also investigate, for the first time, the influence of the training tasks on test performance. Our analyses demonstrate the effectiveness of IRENE in combining prior knowledge gained during training for unseen evaluation tasks.

#3 An Empirical Study of CLIP for Text-Based Person Search [PDF] [Copy] [Kimi⁴]

Authors: Min Cao ; Yang Bai ; Ziyin Zeng ; Mang Ye ; Min Zhang

Text-based Person Search (TBPS) aims to retrieve the person images using natural language descriptions. Recently, Contrastive Language Image Pretraining (CLIP), a universal large cross-modal vision-language pre-training model, has remarkably performed over various cross-modal downstream tasks due to its powerful cross-modal semantic learning capacity. TPBS, as a fine-grained cross-modal retrieval task, is also facing the rise of research on the CLIP-based TBPS. In order to explore the potential of the visual-language pre-training model for downstream TBPS tasks, this paper makes the first attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the TBPS community. We revisit critical design considerations under CLIP, including data augmentation and loss function. The model, with the aforementioned designs and practical training tricks, can attain satisfactory performance without any sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in model generalization and model compression, demonstrating the effectiveness of TBPS-CLIP from various aspects. This work is expected to provide empirical insights and highlight future CLIP-based TBPS research.

#4 Social Physics Informed Diffusion Model for Crowd Simulation [PDF] [Copy] [Kimi]

Authors: Hongyi Chen ; Jingtao Ding ; Yong Li ; Yue Wang ; Xiao-Ping Zhang

Crowd simulation holds crucial applications in various domains, such as urban planning, architectural design, and traffic arrangement. In recent years, physics-informed machine learning methods have achieved state-of-the-art performance in crowd simulation but fail to model the heterogeneity and multi-modality of human movement comprehensively. In this paper, we propose a social physics-informed diffusion model named SPDiff to mitigate the above gap. SPDiff takes both the interactive and historical information of crowds in the current timeframe to reverse the diffusion process, thereby generating the distribution of pedestrian movement in the subsequent timeframe. Inspired by the well-known social physics model, i.e., Social Force, regarding crowd dynamics, we design a crowd interaction encoder to guide the denoising process and further enhance this module with the equivariant properties of crowd interactions. To mitigate error accumulation in long-term simulations, we propose a multi-frame rollout training algorithm for diffusion modeling. Experiments conducted on two real-world datasets demonstrate the superior performance of SPDiff in terms of both macroscopic and microscopic evaluation metrics. Code and appendix are available at https://github.com/tsinghua-fib-lab/SPDiff.

#5 Trend-Aware Supervision: On Learning Invariance for Semi-supervised Facial Action Unit Intensity Estimation [PDF] [Copy] [Kimi]

Authors: Yingjie Chen ; Jiarui Zhang ; Tao Wang ; Yun Liang

With the increasing need for facial behavior analysis, semi-supervised AU intensity estimation using only keyframe annotations has emerged as a practical and effective solution to relieve the burden of annotation. However, the lack of annotations makes the spurious correlation problem caused by AU co-occurrences and subject variation much more prominent, leading to non-robust intensity estimation that is entangled among AUs and biased among subjects. We observe that trend information inherent in keyframe annotations could act as extra supervision and raising the awareness of AU-specific facial appearance changing trends during training is the key to learning invariant AU-specific features. To this end, we propose Trend-AwareSupervision (TAS), which pursues three kinds of trend awareness, including intra-trend ranking awareness, intra-trend speed awareness, and inter-trend subject awareness. TAS alleviates the spurious correlation problem by raising trend awareness during training to learn AU-specific features that represent the corresponding facial appearance changes, to achieve intensity estimation invariance. Experiments conducted on two commonly used AU benchmark datasets, BP4D and DISFA, show the effectiveness of each kind of awareness. And under trend-aware supervision, the performance can be improved without extra computational or storage costs during inference.

#6 Enhancing the Robustness of Spiking Neural Networks with Stochastic Gating Mechanisms [PDF] [Copy] [Kimi]

Authors: Jianhao Ding ; Zhaofei Yu ; Tiejun Huang ; Jian K. Liu

Spiking neural networks (SNNs) exploit neural spikes to provide solutions for low-power intelligent applications on neuromorphic hardware. Although SNNs have high computational efficiency due to spiking communication, they still lack resistance to adversarial attacks and noise perturbations. In the brain, neuronal responses generally possess stochasticity induced by ion channels and synapses, while the role of stochasticity in computing tasks is poorly understood. Inspired by this, we elaborate a stochastic gating spiking neural model for layer-by-layer spike communication, introducing stochasticity to SNNs. Through theoretical analysis, our gating model can be viewed as a regularizer that prevents error amplification under attacks. Meanwhile, our work can explain the robustness of Poisson coding. Experimental results prove that our method can be used alone or with existing robust enhancement algorithms to improve SNN robustness and reduce SNN energy consumption. We hope our work will shed new light on the role of stochasticity in the computation of SNNs. Our code is available at https://github.com/DingJianhao/StoG-meets-SNN/.

#7 Imitation of Life: A Search Engine for Biologically Inspired Design [PDF] [Copy] [Kimi¹]

Authors: Hen Emuna ; Nadav Borenstein ; Xin Qian ; Hyeonsu Kang ; Joel Chan ; Aniket Kittur ; Dafna Shahaf

Biologically Inspired Design (BID), or Biomimicry, is a problem-solving methodology that applies analogies from nature to solve engineering challenges. For example, Speedo engineers designed swimsuits based on shark skin. Finding relevant biological solutions for real-world problems poses significant challenges, both due to the limited biological knowledge engineers and designers typically possess and to the limited BID resources. Existing BID datasets are hand-curated and small, and scaling them up requires costly human annotations. In this paper, we introduce BARcode (Biological Analogy Retriever), a search engine for automatically mining bio-inspirations from the web at scale. Using advances in natural language understanding and data programming, BARcode identifies potential inspirations for engineering challenges. Our experiments demonstrate that BARcode can retrieve inspirations that are valuable to engineers and designers tackling real-world problems, as well as recover famous historical BID examples. We release data and code; we view BARcode as a step towards addressing the challenges that have historically hindered the practical application of BID to engineering innovation.

#8 An Efficient Knowledge Transfer Strategy for Spiking Neural Networks from Static to Event Domain [PDF] [Copy] [Kimi¹]

Authors: Xiang He ; Dongcheng Zhao ; Yang Li ; Guobin Shen ; Qingqun Kong ; Yi Zeng

Spiking neural networks (SNNs) are rich in spatio-temporal dynamics and are suitable for processing event-based neuromorphic data. However, event-based datasets are usually less annotated than static datasets. This small data scale makes SNNs prone to overfitting and limits their performance. In order to improve the generalization ability of SNNs on event-based datasets, we use static images to assist SNN training on event data. In this paper, we first discuss the domain mismatch problem encountered when directly transferring networks trained on static datasets to event data. We argue that the inconsistency of feature distributions becomes a major factor hindering the effective transfer of knowledge from static images to event data. To address this problem, we propose solutions in terms of two aspects: feature distribution and training strategy. Firstly, we propose a knowledge transfer loss, which consists of domain alignment loss and spatio-temporal regularization. The domain alignment loss learns domain-invariant spatial features by reducing the marginal distribution distance between the static image and the event data. Spatio-temporal regularization provides dynamically learnable coefficients for domain alignment loss by using the output features of the event data at each time step as a regularization term. In addition, we propose a sliding training strategy, which gradually replaces static image inputs probabilistically with event data, resulting in a smoother and more stable training for the network. We validate our method on neuromorphic datasets, including N-Caltech101, CEP-DVS, and N-Omniglot. The experimental results show that our proposed method achieves better performance on all datasets compared to the current state-of-the-art methods. Code is available at https://github.com/Brain-Cog-Lab/Transfer-for-DVS.

#9 Responding to the Call: Exploring Automatic Music Composition Using a Knowledge-Enhanced Model [PDF] [Copy] [Kimi]

Authors: Zhejing Hu ; Yan Liu ; Gong Chen ; Xiao Ma ; Shenghua Zhong ; Qianwen Luo

Call-and-response is a musical technique that enriches the creativity of music, crafting coherent musical ideas that mirror the back-and-forth nature of human dialogue with distinct musical characteristics. Although this technique is integral to numerous musical compositions, it remains largely uncharted in automatic music composition. To enhance the creativity of machine-composed music, we first introduce the Call-Response Dataset (CRD) containing 19,155 annotated musical pairs and crafted comprehensive objective evaluation metrics for musical assessment. Then, we design a knowledge-enhanced learning-based method to bridge the gap between human and machine creativity. Specifically, we train the composition module using the call-response pairs, supplementing it with musical knowledge in terms of rhythm, melody, and harmony. Our experimental results underscore that our proposed model adeptly produces a wide variety of creative responses for various musical calls.

#10 Neural Amortized Inference for Nested Multi-Agent Reasoning [PDF] [Copy] [Kimi]

Authors: Kunal Jha ; Tuan Anh Le ; Chuanyang Jin ; Yen-Ling Kuo ; Joshua B. Tenenbaum ; Tianmin Shu

Multi-agent interactions, such as communication, teaching, and bluffing, often rely on higher-order social inference, i.e., understanding how others infer oneself. Such intricate reasoning can be effectively modeled through nested multi-agent reasoning. Nonetheless, the computational complexity escalates exponentially with each level of reasoning, posing a significant challenge. However, humans effortlessly perform complex social inferences as part of their daily lives. To bridge the gap between human-like inference capabilities and computational limitations, we propose a novel approach: leveraging neural networks to amortize high-order social inference, thereby expediting nested multi-agent reasoning. We evaluate our method in two challenging multi-agent interaction domains. The experimental results demonstrate that our method is computationally efficient while exhibiting minimal degradation in accuracy.

#11 Hidden Follower Detection: How Is the Gaze-Spacing Pattern Embodied in Frequency Domain? [PDF] [Copy] [Kimi]

Authors: Shu Li ; Ruimin Hu ; Suhui Li ; Liang Liao

Spatiotemporal social behavior analysis is a technique that studies the social behavior patterns of objects and estimates their risks based on their trajectories. In social public scenarios such as train stations, hidden following behavior has become one of the most challenging issues due to its probability of evolving into violent events, which is more than 25%. In recent years, research on hidden following detection (HFD) has focused on differences in time series between hidden followers and normal pedestrians under two temporal characteristics: gaze and spatial distance. However, the time-domain representation for time series is irreversible and usually causes the loss of critical information. In this paper, we deeply study the expression efficiency of time/frequency domain features of time series, by exploring the recovery mechanism of features to source time series, we establish a fidelity estimation method for feature expression and a selection model for frequency-domain features based on the signal-to-distortion ratio (SDR). Experimental results demonstrate the feature fidelity of time series and HFD performance are positively correlated, and the fidelity of frequency-domain features and HFD performance are significantly better than the time-domain features. On both real and simulated datasets, the accuracy of the proposed method is increased by 3%, and the gaze-only module is improved by 10%. Related research has explored new methods for optimal feature selection based on fidelity, new patterns for efficient feature expression of hidden following behavior, and the mechanism of multimodal collaborative identification.

#12 Music Style Transfer with Time-Varying Inversion of Diffusion Models [PDF] [Copy] [Kimi]

Authors: Sifei Li ; Yuxin Zhang ; Fan Tang ; Chongyang Ma ; Weiming Dong ; Changsheng Xu

With the development of diffusion models, text-guided image style transfer has demonstrated great controllable and high-quality results. However, the utilization of text for diverse music style transfer poses significant challenges, primarily due to the limited availability of matched audio-text datasets. Music, being an abstract and complex art form, exhibits variations and intricacies even within the same genre, thereby making accurate textual descriptions challenging. This paper presents a music style transfer approach that effectively captures musical attributes using minimal data. We introduce a novel time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, we utilize a bias-reduced stylization technique to get stable results. Experimental results demonstrate that our method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Samples and code are available at https://lsfhuihuiff.github.io/MusicTI/.

#13 A Brain-Inspired Way of Reducing the Network Complexity via Concept-Regularized Coding for Emotion Recognition [PDF] [Copy] [Kimi]

Authors: Han Lu ; Xiahai Zhuang ; Qiang Luo

The human brain can effortlessly and reliably perceive emotions, whereas existing facial emotion recognition (FER) methods suffer from drawbacks such as complex model structures, high storage requirements, and poor interpretability. Inspired by the role of emotion concepts in visual perception coding within the human brain, we propose a dual-pathway framework emulating the neural computation of emotion recognition. Specifically, these two pathways are designed to model the representation of emotion concepts in the brain and the visual perception process, respectively. For the former, we adopt a disentangled approach to extract emotion concepts from complex facial geometric attributes; for the latter, we employ an emotional confidence evaluation strategy to determine which concept is optimal for regularizing the perceptual coding. The proposed concept-regularized coding strategy endows the framework with flexibility and interpretability as well as good performances on several benchmarking FER datasets.

#14 Multi-Energy Guided Image Translation with Stochastic Differential Equations for Near-Infrared Facial Expression Recognition [PDF] [Copy] [Kimi]

Authors: Bingjun Luo ; Zewen Wang ; Jinpeng Wang ; Junjie Zhu ; Xibin Zhao ; Yue Gao

Illumination variation has been a long-term challenge in real-world facial expression recognition (FER). Under uncontrolled or non-visible light conditions, near-infrared (NIR) can provide a simple and alternative solution to obtain high-quality images and supplement the geometric and texture details that are missing in the visible (VIS) domain. Due to the lack of large-scale NIR facial expression datasets, directly extending VIS FER methods to the NIR spectrum may be ineffective. Additionally, previous heterogeneous image synthesis methods are restricted by low controllability without prior task knowledge. To tackle these issues, we present the first approach, called for NIR-FER Stochastic Differential Equations (NFER-SDE), that transforms face expression appearance between heterogeneous modalities to the overfitting problem on small-scale NIR data. NFER-SDE can take the whole VIS source image as input and, together with domain-specific knowledge, guide the preservation of modality-invariant information in the high-frequency content of the image. Extensive experiments and ablation studies show that NFER-SDE significantly improves the performance of NIR FER and achieves state-of-the-art results on the only two available NIR FER datasets, Oulu-CASIA and Large-HFE.

#15 Successive POI Recommendation via Brain-Inspired Spatiotemporal Aware Representation [PDF] [Copy] [Kimi]

Authors: Gehua Ma ; He Wang ; Jingyuan Zhao ; Rui Yan ; Huajin Tang

Existing approaches usually perform spatiotemporal representation in the spatial and temporal dimensions, respectively, which isolates the spatial and temporal natures of the target and leads to sub-optimal embeddings. Neuroscience research has shown that the mammalian brain entorhinal-hippocampal system provides efficient graph representations for general knowledge. Moreover, entorhinal grid cells present concise spatial representations, while hippocampal place cells represent perception conjunctions effectively. Thus, the entorhinal-hippocampal system provides a novel angle for spatiotemporal representation, which inspires us to propose the SpatioTemporal aware Embedding framework (STE) and apply it to POIs (STEP). STEP considers two types of POI-specific representations: sequential representation and spatiotemporal conjunctive representation, learned using sparse unlabeled data based on the proposed graph-building policies. Notably, STEP jointly represents the spatiotemporal natures of POIs using both observations and contextual information from integrated spatiotemporal dimensions by constructing a spatiotemporal context graph. Furthermore, we introduce a successive POI recommendation method using STEP, which achieves state-of-the-art performance on two benchmarks. In addition, we demonstrate the excellent performance of the STE representation approach in other spatiotemporal representation-centered tasks through a case study of the traffic flow prediction problem. Therefore, this work provides a novel solution to spatiotemporal representation and paves a new way for spatiotemporal modeling-related tasks.

#16 BDIQA: A New Dataset for Video Question Answering to Explore Cognitive Reasoning through Theory of Mind [PDF] [Copy] [Kimi¹]

Authors: Yuanyuan Mao ; Xin Lin ; Qin Ni ; Liang He

As a foundational component of cognitive intelligence, theory of mind (ToM) can make AI more closely resemble human thought processes, thereby enhancing their interaction and collaboration with human. In particular, it can significantly improve a model's comprehension of videos in complex scenes. However, current video question answer (VideoQA) datasets focus on studying causal reasoning within events, few of them genuinely incorporating human ToM. Consequently, there is a lack of development in ToM reasoning tasks within the area of VideoQA. This paper presents BDIQA, the first benchmark to explore the cognitive reasoning capabilities of VideoQA models in the context of ToM. BDIQA is inspired by the cognitive development of children's ToM and addresses the current deficiencies in machine ToM within datasets and tasks. Specifically, it offers tasks at two difficulty levels, assessing Belief, Desire and Intention (BDI) reasoning in both simple and complex scenarios. We conduct evaluations on several mainstream methods of VideoQA and diagnose their capabilities with zero-shot, few-shot and supervised learning. We find that the performance of pre-trained models on cognitive reasoning tasks remains unsatisfactory. To counter this challenge, we undertake thorough analysis and experimentation, ultimately presenting two guidelines to enhance cognitive reasoning derived from ablation analysis.

#17 Unveiling the Significance of Toddler-Inspired Reward Transition in Goal-Oriented Reinforcement Learning [PDF] [Copy] [Kimi]

Authors: Junseok Park ; Yoonsung Kim ; Hee bin Yoo ; Min Whoo Lee ; Kibeom Kim ; Won-Seok Choi ; Minsu Lee ; Byoung-Tak Zhang

Toddlers evolve from free exploration with sparse feedback to exploiting prior experiences for goal-directed learning with denser rewards. Drawing inspiration from this Toddler-Inspired Reward Transition, we set out to explore the implications of varying reward transitions when incorporated into Reinforcement Learning (RL) tasks. Central to our inquiry is the transition from sparse to potential-based dense rewards, which share optimal strategies regardless of reward changes. Through various experiments, including those in egocentric navigation and robotic arm manipulation tasks, we found that proper reward transitions significantly influence sample efficiency and success rates. Of particular note is the efficacy of the toddler-inspired Sparse-to-Dense (S2D) transition. Beyond these performance metrics, using Cross-Density Visualizer technique, we observed that transitions, especially the S2D, smooth the policy loss landscape, promoting wide minima that enhance generalization in RL models.

#18 Gated Attention Coding for Training High-Performance and Efficient Spiking Neural Networks [PDF] [Copy] [Kimi]

Authors: Xuerui Qiu ; Rui-Jie Zhu ; Yuhong Chou ; Zhaorui Wang ; Liang-Jian Deng ; Guoqi Li

Spiking neural networks (SNNs) are emerging as an energy-efficient alternative to traditional artificial neural networks (ANNs) due to their unique spike-based event-driven nature. Coding is crucial in SNNs as it converts external input stimuli into spatio-temporal feature sequences. However, most existing deep SNNs rely on direct coding that generates powerless spike representation and lacks the temporal dynamics inherent in human vision. Hence, we introduce Gated Attention Coding (GAC), a plug-and-play module that leverages the multi-dimensional gated attention unit to efficiently encode inputs into powerful representations before feeding them into the SNN architecture. GAC functions as a preprocessing layer that does not disrupt the spike-driven nature of the SNN, making it amenable to efficient neuromorphic hardware implementation with minimal modifications. Through an observer model theoretical analysis, we demonstrate GAC's attention mechanism improves temporal dynamics and coding efficiency. Experiments on CIFAR10/100 and ImageNet datasets demonstrate that GAC achieves state-of-the-art accuracy with remarkable efficiency. Notably, we improve top-1 accuracy by 3.10% on CIFAR100 with only 6-time steps and 1.07% on ImageNet while reducing energy usage to 66.9% of the previous works. To our best knowledge, it is the first time to explore the attention-based dynamic coding scheme in deep SNNs, with exceptional effectiveness and efficiency on large-scale datasets. Code is available at https://github.com/bollossom/GAC.

#19 Efficient Spiking Neural Networks with Sparse Selective Activation for Continual Learning [PDF] [Copy] [Kimi]

Authors: Jiangrong Shen ; Wenyao Ni ; Qi Xu ; Huajin Tang

The next generation of machine intelligence requires the capability of continual learning to acquire new knowledge without forgetting the old one while conserving limited computing resources. Spiking neural networks (SNNs), compared to artificial neural networks (ANNs), have more characteristics that align with biological neurons, which may be helpful as a potential gating function for knowledge maintenance in neural networks. Inspired by the selective sparse activation principle of context gating in biological systems, we present a novel SNN model with selective activation to achieve continual learning. The trace-based K-Winner-Take-All (K-WTA) and variable threshold components are designed to form the sparsity in selective activation in spatial and temporal dimensions of spiking neurons, which promotes the subpopulation of neuron activation to perform specific tasks. As a result, continual learning can be maintained by routing different tasks via different populations of neurons in the network. The experiments are conducted on MNIST and CIFAR10 datasets under the class incremental setting. The results show that the proposed SNN model achieves competitive performance similar to and even surpasses the other regularization-based methods deployed under traditional ANNs.

#20 Boosting Neural Cognitive Diagnosis with Student’s Affective State Modeling [PDF] [Copy] [Kimi]

Authors: Shanshan Wang ; Zhen Zeng ; Xun Yang ; Ke Xu ; Xingyi Zhang

Cognitive Diagnosis Modeling aims to infer students' proficiency level on knowledge concepts from their response logs. Existing methods typically model students’ response processes as the interaction between students and exercises or concepts based on hand-crafted or deeply-learned interaction functions. Despite their promising achievements, they fail to consider the relationship between students' cognitive states and affective states in learning, e.g., the feelings of frustration, boredom, or confusion with the learning content, which is insufficient for comprehensive cognitive diagnosis in intelligent education. To fill the research gap, we propose a novel Affect-aware Cognitive Diagnosis (ACD) model which can effectively diagnose the knowledge proficiency levels of students by taking into consideration the affective factors. Specifically, we first design a student affect perception module under the assumption that the affective state is jointly influenced by the student's affect trait and the difficulty of the exercise. Then, our inferred affective distribution is further used to estimate the student's subjective factors, i.e., guessing and slipping, respectively. Finally, we integrate the estimated guessing and slipping parameters with the basic neural cognitive diagnosis framework based on the DINA model, which facilitates the modeling of complex exercising interactions in a more accurate and interpretable fashion. Besides, we also extend our affect perception module in an unsupervised learning setting based on contrastive learning, thus significantly improving the compatibility of our ACD. To the best of our knowledge, we are the first to unify the cognition modeling and affect modeling into the same framework for student cognitive diagnosis. Extensive experiments on real-world datasets clearly demonstrate the effectiveness of our ACD. Our code is available at https://github.com/zeng-zhen/ACD.

#21 DMMR: Cross-Subject Domain Generalization for EEG-Based Emotion Recognition via Denoising Mixed Mutual Reconstruction [PDF] [Copy] [Kimi]

Authors: Yiming Wang ; Bin Zhang ; Yujiao Tang

Electroencephalography (EEG) has proven to be effective in emotion analysis. However, current methods struggle with individual variations, complicating the generalization of models trained on data from source subjects to unseen target subjects. To tackle this issue, we propose the Denoising Mixed Mutual Reconstruction (DMMR) model, employing a two-stage pre-training followed by fine-tuning approach. During the pre-training phase, DMMR leverages self-supervised learning through a multi-decoder autoencoder, which encodes and reconstructs features of one subject, aiming to generate features resembling those from other subjects within the same category, thereby encouraging the encoder to learn subject-invariant features. We introduce a hidden-layer mixed data augmentation approach to mitigate the limitations posed by the scarcity of source data, thereby extending the method to a two-stage process. To bolster stability against noise, we incorporate a noise injection method, named “Time Steps Shuffling”, into the input data. During the fine-tuning phase, an emotion classifier is integrated to extract emotion-related features. Experimental accuracy on the SEED and SEED-IV datasets reached 88.27% (±5.62) and 72.70% (±8.01), respectively, demonstrating state-of-the-art and comparable performance, thereby showcasing the superiority of DMMR. The proposed data augmentation and noise injection methods were observed to complementarily enhance accuracy and stability, thus alleviating the aforementioned issues.

#22 Transient Glimpses: Unveiling Occluded Backgrounds through the Spike Camera [PDF] [Copy] [Kimi]

Authors: Jiyuan Zhang ; Shiyan Chen ; Yajing Zheng ; Zhaofei Yu ; Tiejun Huang

The de-occlusion problem, involving extracting clear background images by removing foreground occlusions, holds significant practical importance but poses considerable challenges. Most current research predominantly focuses on generating discrete images from calibrated camera arrays, but this approach often struggles with dense occlusions and fast motions due to limited perspectives and motion blur. To overcome these limitations, an effective solution requires the integration of multi-view visual information. The spike camera, as an innovative neuromorphic sensor, shows promise with its ultra-high temporal resolution and dynamic range. In this study, we propose a novel approach that utilizes a single spike camera for continuous multi-view imaging to address occlusion removal. By rapidly moving the spike camera, we capture a dense stream of spikes from occluded scenes. Our model, SpkOccNet, processes these spikes by integrating multi-view spatial-temporal information via long-short-window feature extractor (LSW) and employs a novel cross-view mutual attention-based module (CVA) for effective fusion and refinement. Additionally, to facilitate research in occlusion removal, we introduce the S-OCC dataset, which consists of real-world spike-based data. Experimental results demonstrate the efficiency and generalization capabilities of our model in effectively removing dense occlusions across diverse scenes. Public project page: https://github.com/Leozhangjiyuan/SpikeDeOcclusion.

#23 Open-Set Facial Expression Recognition [PDF] [Copy] [Kimi]

Authors: Yuhang Zhang ; Yue Yao ; Xuannan Liu ; Lixiong Qin ; Wenjing Wang ; Weihong Deng

Facial expression recognition (FER) models are typically trained on datasets with a fixed number of seven basic classes. However, recent research works (Cowen et al. 2021; Bryant et al. 2022; Kollias 2023) point out that there are far more expressions than the basic ones. Thus, when these models are deployed in the real world, they may encounter unknown classes, such as compound expressions that cannot be classified into existing basic classes. To address this issue, we propose the open-set FER task for the first time. Though there are many existing open-set recognition methods, we argue that they do not work well for open-set FER because FER data are all human faces with very small inter-class distances, which makes the open-set samples very similar to close-set samples. In this paper, we are the first to transform the disadvantage of small inter-class distance into an advantage by proposing a new way for open-set FER. Specifically, we find that small inter-class distance allows for sparsely distributed pseudo labels of open-set samples, which can be viewed as symmetric noisy labels. Based on this novel observation, we convert the open-set FER to a noisy label detection problem. We further propose a novel method that incorporates attention map consistency and cycle training to detect the open-set samples. Extensive experiments on various FER datasets demonstrate that our method clearly outperforms state-of-the-art open-set recognition methods by large margins. Code is available at https://github.com/zyh-uaiaaaa.

#24 Bootstrapping Cognitive Agents with a Large Language Model [PDF] [Copy] [Kimi]

Authors: Feiyu Zhu ; Reid Simmons

Large language models contain noisy general knowledge of the world, yet are hard to train or fine-tune. In contrast cognitive architectures have excellent interpretability and are flexible to update but require a lot of manual work to instantiate. In this work, we combine the best of both worlds: bootstrapping a cognitive-based model with the noisy knowledge encoded in large language models. Through an embodied agent doing kitchen tasks, we show that our proposed framework yields better efficiency compared to an agent entirely based on large language models. Our experiments also indicate that the cognitive agent bootstrapped using this framework can generalize to novel environments and be scaled to complex tasks.

#25 Data Augmented Graph Neural Networks for Personality Detection [PDF] [Copy] [Kimi]

Authors: Yangfu Zhu ; Yue Xia ; Meiling Li ; Tingting Zhang ; Bin Wu

Personality detection is a fundamental task for user psychology research. One of the biggest challenges in personality detection lies in the quantitative limitation of labeled data collected by completing the personality questionnaire, which is very time-consuming and labor-intensive. Most of the existing works are mainly devoted to learning the rich representations of posts based on labeled data. However, they still suffer from the inherent weakness of the amount limitation of labels, which potentially restricts the capability of the model to deal with unseen data. In this paper, we construct a heterogeneous personality graph for each labeled and unlabeled user and develop a novel psycholinguistic augmented graph neural network to detect personality in a semi-supervised manner, namely Semi-PerGCN. Specifically, our model first explores a supervised Personality Graph Neural Network (PGNN) to refine labeled user representation on the heterogeneous graph. For the remaining massive unlabeled users, we utilize the empirical psychological knowledge of the Linguistic Inquiry and Word Count (LIWC) lexicon for multi-view graph augmentation and perform unsupervised graph consistent constraints on the parameters shared PGNN. During the learning process of finite labeled users, noise-invariant learning on a large scale of unlabeled users is combined to enhance the generalization ability. Extensive experiments on three real-world datasets, Youtube, PAN2015, and MyPersonality demonstrate the effectiveness of our Semi-PerGCN in personality detection, especially in scenarios with limited labeled users.